Document Clustering using Small World Communities

ثبت نشده
چکیده

Previous research has shown the words in natural language documents exist as a small world network. Thus it might be feasible to use extensive physics algorithms for extracting community structure. We present a novel method for semantically clustering a large collection of documents using small world communities. We combine specially modified physics algorithms with traditional information retrieval techniques. A term network is generated from the document collection, the terms are clustered into small world communities, and the semantic term clusters are used to generate overlapping document clusters. Clustering 90K documents took 20 seconds, generating good quality community clusters in nearly linear running time, O(n log n) where n is the size of the lexicon in the document collection.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting Overlapping Communities in Social Networks using Deep Learning

In network analysis, a community is typically considered of as a group of nodes with a great density of edges among themselves and a low density of edges relative to other network parts. Detecting a community structure is important in any network analysis task, especially for revealing patterns between specified nodes. There is a variety of approaches presented in the literature for overlapping...

متن کامل

Construction of Web Community Directories using Document Clustering and Web Usage Mining

This paper presents the concept of Web Community Directories, as a means of personalizing services on the Web, together with a novel methodology for the construction of these directories by document clustering and usage mining methods. The community models are extracted with the use of the Community Directory Miner, a simple cluster mining algorithm which has been extended to ascend a concept h...

متن کامل

Geographically Organized Small Communities and the Hardness of Clustering Social Networks

Spectral clustering, while perhaps the most efficient heuristics for graph partitioning, has recently gathered bad reputation for failure over large-scale power law graphs. In this chapter we identify the abundance of small-size communities connected by long tentacles as the major obstacle for spectral clustering. These subgraphs hide the higher level structure and result in a highly degenerate...

متن کامل

Optimization of Initial Centroids for K-Means Algorithm Based on Small World Network

K-means algorithm is a relatively simple and fast gather clustering algorithm. However, the initial clustering center of the traditional k-means algorithm was generated randomly from the dataset, and the clustering result was unstable. In this paper, we propose a novel method to optimize the selection of initial centroids for k-means algorithm based on the small world network. This paper firstl...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006